Marketing Campaign Analysis¶

Problem Definition¶

The Context:¶

  • We at Tenenbaum Data Consultants have been tasked with conducting exploratory data analysis on a customer dataset, which includs both customer behavior and demographic information.

The objective:¶

  • Given the budgetary limitations of the business, determine how customer segmentation can optimize return on investment in the form of added revenue from sales.

The key questions:¶

  • Are there segments of the customer base more responsive to marketing campaigns?
  • Are there segments of the customer base more responsive to discounts? and for both of the questions:
  • What are the characteristics of these customers and through what medium can they be reached?

The problem formulation:¶

  • Maximize profits for an investment of a marketing and discount budget by using data to identify clusters of customers for marketing.

Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Import the necessary libraries and load the data¶

In [135]:
!pip install scikit-learn-extra
Requirement already satisfied: scikit-learn-extra in /usr/local/lib/python3.10/dist-packages (0.3.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.25.2)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.11.4)
Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.4.0)
In [136]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

# Importing PCA and t-SNE
from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

# Importing clustering algorithms
from sklearn.cluster import KMeans

from sklearn.mixture import GaussianMixture

from sklearn_extra.cluster import KMedoids

from sklearn.cluster import AgglomerativeClustering

from sklearn.cluster import DBSCAN

# Silhouette score
from sklearn.metrics import silhouette_score

from statistics import mode

import warnings
warnings.filterwarnings("ignore")

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
  • Drop the column which has no null values
In [137]:
data=pd.read_csv("https://github.com/jesusina/clustering_PCA/blob/main/marketing_campaign.csv?raw=true")
In [138]:
data.head()
Out[138]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 10 4 7 0 0 0 0 0 0 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 1 2 5 0 0 0 0 0 0 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 2 10 4 0 0 0 0 0 0 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 0 4 6 0 0 0 0 0 0 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 3 6 5 0 0 0 0 0 0 0

5 rows × 27 columns

In [139]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB

We have 24 missing values for income. We'll fill them in with normally distributed values matching the dataset, using median instead of mean to protect against outliers.

In [140]:
meanIncome, stdIncome=data['Income'].median(), data['Income'].std()
indices_to_be_updated=data[data['Income'].isnull()].index
data.loc[indices_to_be_updated, 'Income']=np.random.normal(meanIncome, stdIncome, len(indices_to_be_updated))
In [141]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2240 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB

Let's check for duplicates:

In [142]:
len(data['ID'].unique())
Out[142]:
2240
  1. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
In [143]:
data['Marital_Status'].value_counts()
Out[143]:
Married     864
Together    580
Single      480
Divorced    232
Widow        77
Alone         3
Absurd        2
YOLO          2
Name: Marital_Status, dtype: int64
In [144]:
data['Education'].value_counts()
Out[144]:
Graduation    1127
PhD            486
Master         370
2n Cycle       203
Basic           54
Name: Education, dtype: int64

Attempting to interpret these categories.

In [145]:
data.groupby('Education')['Income'].mean()
Out[145]:
Education
2n Cycle      47721.263842
Basic         20306.259259
Graduation    52743.122172
Master        52782.396741
PhD           56072.733886
Name: Income, dtype: float64
In [146]:
sns.boxplot(data, x='Education', y='Income')
Out[146]:
<Axes: xlabel='Education', ylabel='Income'>

We have a single outlier in income, roughly 3 times that of any other data points. Who is this person?

In [147]:
data[data['Income']>200000]
Out[147]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
2233 9432 1977 Graduation Together 666666.0 1 0 02-06-2013 23 9 ... 1 3 6 0 0 0 0 0 0 0

1 rows × 27 columns

This person was likely not reporting their income seriously. Because this person has other data, let's keep this person, but change their income to the average for their "Graduation" education level.

In [148]:
data.at[2233,'Income']=data.groupby('Education')['Income'].median()['Graduation']
In [149]:
sns.boxplot(data, x='Education', y='Income', hue='Marital_Status')
Out[149]:
<Axes: xlabel='Education', ylabel='Income'>
In [150]:
data['Dt_Customer'].value_counts()
Out[150]:
31-08-2012    12
12-09-2012    11
14-02-2013    11
12-05-2014    11
20-08-2013    10
              ..
05-08-2012     1
18-11-2012     1
25-05-2013     1
14-04-2013     1
09-01-2014     1
Name: Dt_Customer, Length: 663, dtype: int64

Observations and Insights from the Data overview:¶

From the context of these columns, we are dealing with a grocery story with an online presence, trying to determine how to spend its marketing budget to maximize profit.

There are 2240 unique customers and no missing values in the dataset. Not counting customer ID, we have 26 columns, of which 23 are numerical. Nine of these numerical columns are dummy variables. The remaining 3 non-numerical columns include a date string (Dt_Customer), education, and marital status. These need to be treated carefully.

Dt_Customer: There are 663 unique dates on which our 2240 customers were acquired.

Education: Based on univariate analysis in income, "Basic" here seems to indicate less than a bachelors degree. "2ncycle" is the European equivalent of masters degree, though it may be worthwhile to keep around as a distinct category.

Marital Status: the responses given here indicate customers were permitted to fill out an "Other" category with a customer string response. The author of this notebook would advise the data engineers to modify this to only permit the choices: married, together, single, divorced, widow, since "alone" should be taken as single and the remaining answers of "absurd" and "YOLO" are obviously not useful. We'll deal with the treatment of these below.

Converting "Alone" to "Single":

In [151]:
data['Marital_Status']=data['Marital_Status'].apply(lambda x: 'Single' if  x=='Alone' else x)
data['Marital_Status'].value_counts()
Out[151]:
Married     864
Together    580
Single      483
Divorced    232
Widow        77
Absurd        2
YOLO          2
Name: Marital_Status, dtype: int64

Data Prepreprocessing/Exploratory Data Analysis¶

Let's clean up the date column

In [152]:
data['Dt_Customer']=pd.to_datetime(data['Dt_Customer'],dayfirst=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   ID                   2240 non-null   int64         
 1   Year_Birth           2240 non-null   int64         
 2   Education            2240 non-null   object        
 3   Marital_Status       2240 non-null   object        
 4   Income               2240 non-null   float64       
 5   Kidhome              2240 non-null   int64         
 6   Teenhome             2240 non-null   int64         
 7   Dt_Customer          2240 non-null   datetime64[ns]
 8   Recency              2240 non-null   int64         
 9   MntWines             2240 non-null   int64         
 10  MntFruits            2240 non-null   int64         
 11  MntMeatProducts      2240 non-null   int64         
 12  MntFishProducts      2240 non-null   int64         
 13  MntSweetProducts     2240 non-null   int64         
 14  MntGoldProds         2240 non-null   int64         
 15  NumDealsPurchases    2240 non-null   int64         
 16  NumWebPurchases      2240 non-null   int64         
 17  NumCatalogPurchases  2240 non-null   int64         
 18  NumStorePurchases    2240 non-null   int64         
 19  NumWebVisitsMonth    2240 non-null   int64         
 20  AcceptedCmp3         2240 non-null   int64         
 21  AcceptedCmp4         2240 non-null   int64         
 22  AcceptedCmp5         2240 non-null   int64         
 23  AcceptedCmp1         2240 non-null   int64         
 24  AcceptedCmp2         2240 non-null   int64         
 25  Complain             2240 non-null   int64         
 26  Response             2240 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(23), object(2)
memory usage: 472.6+ KB
In [153]:
data['Year'], data['Month'], data['Day']=data['Dt_Customer'].dt.year, data['Dt_Customer'].dt.month, data['Dt_Customer'].dt.day

data=data.drop(['Dt_Customer'],axis=1)
In [154]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2240 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Recency              2240 non-null   int64  
 8   MntWines             2240 non-null   int64  
 9   MntFruits            2240 non-null   int64  
 10  MntMeatProducts      2240 non-null   int64  
 11  MntFishProducts      2240 non-null   int64  
 12  MntSweetProducts     2240 non-null   int64  
 13  MntGoldProds         2240 non-null   int64  
 14  NumDealsPurchases    2240 non-null   int64  
 15  NumWebPurchases      2240 non-null   int64  
 16  NumCatalogPurchases  2240 non-null   int64  
 17  NumStorePurchases    2240 non-null   int64  
 18  NumWebVisitsMonth    2240 non-null   int64  
 19  AcceptedCmp3         2240 non-null   int64  
 20  AcceptedCmp4         2240 non-null   int64  
 21  AcceptedCmp5         2240 non-null   int64  
 22  AcceptedCmp1         2240 non-null   int64  
 23  AcceptedCmp2         2240 non-null   int64  
 24  Complain             2240 non-null   int64  
 25  Response             2240 non-null   int64  
 26  Year                 2240 non-null   int64  
 27  Month                2240 non-null   int64  
 28  Day                  2240 non-null   int64  
dtypes: float64(1), int64(26), object(2)
memory usage: 507.6+ KB
In [155]:
fig = plt.figure(figsize = (18, 6))

sns.heatmap(data.corr(), annot = True);

plt.xticks(rotation = 45);

Month and day of joining have minimal correlations to our other variables, excepting an interesting connection between month and year.

In [156]:
import plotly.express as px
In [157]:
px.box(data, x='Year', y='Month', points="all")

The negative correlation is purely an artifact of the time window being considered (July 2012-June 2014). Let's drop month and day.

In [158]:
data=data.drop(['Month', 'Day'],axis=1)

Let's look at our YOLO and Absurd Marital_Status points. We note that Teenhome is the strongest correlate to birth year.

In [159]:
sns.catplot(data, y='Year_Birth', x='Marital_Status', hue='Kidhome', col='Teenhome')
Out[159]:
<seaborn.axisgrid.FacetGrid at 0x7a25148a2230>

Exploratory Data Analysis doesn't give much in the way of conclusions for YOLO and "Absurd" marital status. We'll assign them to be married, as the most common marital status. Also, we note outliers in Year_Birth, which we'll assign to the average Year_Birth

In [160]:
data['Marital_Status']=data['Marital_Status'].apply(lambda x: 'Married' if  x=='YOLO' or x=="Absurd" else x)
data['Year_Birth']=data['Year_Birth'].apply(lambda x: data['Year_Birth'].mean() if  x<1920 else x)
#data['Marital_Status']=data['Marital_Status'].apply(lambda x: data.groupby('Year_Birth')['Marital_Status'].mode() if  x=='YOLO' or x=="Absurd" else x)
#data.groupby('Teenhome')['Year_Birth'].mean()[1]
#data[data['Year_Birth']<1920]['Year_Birth']
#data['Marital_Status']=data['Marital_Status'].apply(lambda x: 'Single' if  x=='Alone' else x)
#data['Year_Birth']=data['Year_Birth'].apply(lambda x: data.groupby('Teenhome')['Year_Birth'].mean() if  x<1920 else x)
#data[data['Year_Birth']<1920]['Year_Birth']=data.groupby('Kidhome')['Year_Birth'].mean()[]
In [161]:
sns.catplot(data, y='Year_Birth', x='Marital_Status', hue='Kidhome', col='Education')
Out[161]:
<seaborn.axisgrid.FacetGrid at 0x7a2510472140>

Dropping customer ID, which has been confirmed to be uncorrelated to any of our other variables.

In [162]:
data=data.drop(['ID'], axis=1)
In [163]:
data_dums=pd.get_dummies(data, drop_first=True)
data_dums.describe().T
Out[163]:
count mean std min 25% 50% 75% max
Year_Birth 2240.0 1968.901526 11.694076 1940.0 1959.00 1970.0 1977.00 1996.0
Income 2240.0 51960.598085 21523.288578 1730.0 35303.00 51381.5 68468.25 162397.0
Kidhome 2240.0 0.444196 0.538398 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.506250 0.544538 0.0 0.00 0.0 1.00 2.0
Recency 2240.0 49.109375 28.962453 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.302232 39.773434 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2240.0 0.012946 0.113069 0.0 0.00 0.0 0.00 1.0
Complain 2240.0 0.009375 0.096391 0.0 0.00 0.0 0.00 1.0
Response 2240.0 0.149107 0.356274 0.0 0.00 0.0 0.00 1.0
Year 2240.0 2013.028125 0.684554 2012.0 2013.00 2013.0 2013.00 2014.0
Education_Basic 2240.0 0.024107 0.153416 0.0 0.00 0.0 0.00 1.0
Education_Graduation 2240.0 0.503125 0.500102 0.0 0.00 1.0 1.00 1.0
Education_Master 2240.0 0.165179 0.371425 0.0 0.00 0.0 0.00 1.0
Education_PhD 2240.0 0.216964 0.412270 0.0 0.00 0.0 0.00 1.0
Marital_Status_Married 2240.0 0.387500 0.487288 0.0 0.00 0.0 1.00 1.0
Marital_Status_Single 2240.0 0.215625 0.411347 0.0 0.00 0.0 0.00 1.0
Marital_Status_Together 2240.0 0.258929 0.438144 0.0 0.00 0.0 1.00 1.0
Marital_Status_Widow 2240.0 0.034375 0.182231 0.0 0.00 0.0 0.00 1.0

Questions:¶

1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables

We note a large number of dummy variable columns, where the value is equal to 0 or 1, where the mean is simply the percentage of people having that attribute (e.g. 0.9% of correspondents had complained).

2. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights

In [164]:
data['Marital_Status'].value_counts(), data['Marital_Status'].value_counts()/len(data)
Out[164]:
(Married     868
 Together    580
 Single      483
 Divorced    232
 Widow        77
 Name: Marital_Status, dtype: int64,
 Married     0.387500
 Together    0.258929
 Single      0.215625
 Divorced    0.103571
 Widow       0.034375
 Name: Marital_Status, dtype: float64)
In [165]:
data['Education'].value_counts(), data['Education'].value_counts()/len(data)
Out[165]:
(Graduation    1127
 PhD            486
 Master         370
 2n Cycle       203
 Basic           54
 Name: Education, dtype: int64,
 Graduation    0.503125
 PhD           0.216964
 Master        0.165179
 2n Cycle      0.090625
 Basic         0.024107
 Name: Education, dtype: float64)

The largest segment of our customer based (868 of them, 39%) are married, followed by "Together" (580, 26%) and Single (483, 22%).

The largest segment of our customer base (1127, 50%) had graduated high school, while 486 of them (22%) had PhDs.

3. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?

Though 2n Cycle is essentially the European equivalent of a Masters, we leave the categories as distinct, in the event that they yield some predictive power, if nothing else than as a stand-in for country.

4. There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?

Yes, we have treated this condition. "Alone" was changed to "Single" while "Absurd" and "YOLO" were converted to "Married" as the modal Marital_Status.

Univariate Analysis on Numerical and Categorical data¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

  • Plot histogram and box plot for different numerical features and understand how the data looks like. -Explore the categorical variables like Education, Kidhome, Teenhome, Complain: About 1% of customers complained. For KidHome and Teenhome, the median customer had none. Education has been explored above.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. How does the distribution of Income variable vary across the dataset?
  2. The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
  1. There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? At what percentile the upper whisker lies?
In [166]:
data[data['Income']>120000]
Out[166]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts ... NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Year
164 1973.0 PhD Married 157243.0 0 1 98 20 2 1582 ... 0 0 0 0 0 0 0 0 0 2014
617 1976.0 PhD Together 162397.0 1 1 31 85 1 16 ... 1 1 0 0 0 0 0 0 0 2013
655 1975.0 Graduation Divorced 153924.0 0 0 81 1 1 1 ... 0 0 0 0 0 0 0 0 0 2014
687 1982.0 PhD Married 160803.0 0 0 21 55 16 1622 ... 1 0 0 0 0 0 0 0 0 2012
1300 1971.0 Master Together 157733.0 1 0 37 39 1 9 ... 1 1 0 0 0 0 0 0 0 2013
1653 1977.0 Graduation Together 157146.0 0 0 13 1 0 1725 ... 0 1 0 0 0 0 0 0 0 2013
2132 1949.0 PhD Married 156924.0 0 0 85 2 1 2 ... 0 0 0 0 0 0 0 0 0 2013

7 rows × 26 columns

These 7 outliers represent 0.3% of the data. As such, the right whisker is at the 99.7th percentile. We elect to replace these incomes with incomes normally distributed around the median.

In [167]:
indices_to_be_updated=data[data['Income']>120000].index
data.loc[indices_to_be_updated, 'Income']=np.random.normal(meanIncome, stdIncome, len(indices_to_be_updated))
In [168]:
numerical_cols = [col for col in data.columns if pd.api.types.is_numeric_dtype(data[col])]
for col in numerical_cols:
    print(col)

    print('Skew :', round(data[col].skew(), 2))

    plt.figure(figsize = (15, 4))

    plt.subplot(1, 2, 1)

    data[col].hist(bins = 10, grid = False)

    plt.ylabel('count')

    plt.subplot(1, 2, 2)

    sns.boxplot(x = data[col])

    plt.show()
Year_Birth
Skew : -0.09
Income
Skew : 0.01
Kidhome
Skew : 0.64
Teenhome
Skew : 0.41
Recency
Skew : -0.0
MntWines
Skew : 1.18
MntFruits
Skew : 2.1
MntMeatProducts
Skew : 2.08
MntFishProducts
Skew : 1.92
MntSweetProducts
Skew : 2.14
MntGoldProds
Skew : 1.89
NumDealsPurchases
Skew : 2.42
NumWebPurchases
Skew : 1.38
NumCatalogPurchases
Skew : 1.88
NumStorePurchases
Skew : 0.7
NumWebVisitsMonth
Skew : 0.21
AcceptedCmp3
Skew : 3.29
AcceptedCmp4
Skew : 3.24
AcceptedCmp5
Skew : 3.29
AcceptedCmp1
Skew : 3.56
AcceptedCmp2
Skew : 8.62
Complain
Skew : 10.19
Response
Skew : 1.97
Year
Skew : -0.04

We note the following about the various columns:

  • Birth_Year skews slightly right, as our customers skew younger
  • Income having shaved our outliers, Income is now normally distributed
  • Recency of last purchase shows remarkable uniformity in its distribution.
  • Mnts of various food products all skew strongly right in such a systematic way that the outliers need to be retained.
  • Kidhome and Teenhome both have median values of 0, in that most customers have no dependents living at home.
  • Purchases distributions all skew right.

Bivariate Analysis¶

  • Analyze different categorical and numerical variables and check how different variables are related to each other.
  • Check the relationship of numerical variables with categorical variables.
In [169]:
fig = plt.figure(figsize = (28, 12))

sns.heatmap(data_dums.corr(), annot = True);

plt.xticks(rotation = 70);
In [170]:
sns.catplot(data, y='Year_Birth', x='Marital_Status', hue='Education')
Out[170]:
<seaborn.axisgrid.FacetGrid at 0x7a251605f8b0>
In [171]:
sns.boxplot(data, y='Year_Birth', x='Marital_Status', hue='Education')
Out[171]:
<Axes: xlabel='Marital_Status', ylabel='Year_Birth'>

Key observations are that PhDs and Masters tend to be older and Basic education tends to be younger. The 2n Cycles seem consistently a bit older than their masters counterparts, which means keeping them as separate categories is worthwhie.

In [172]:
sns.boxplot(data, y='Income', x='Education', hue='Marital_Status')
Out[172]:
<Axes: xlabel='Education', ylabel='Income'>

The only notable difference in education and income is for Basic education to earn a lot less. We also note that Widows earn more, but then again they are older so this is expected (corr (Income, Year_Birth)=-0.2

Let's see who does the shopping;

In [173]:
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
sns.boxplot(data, y='Income', x='NumWebPurchases')
plt.subplot(1, 2, 2)
sns.boxplot(data, y='Income', x='NumStorePurchases')
plt.show()
In [174]:
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
sns.boxplot(data, x='NumWebVisitsMonth', y='Income')
plt.subplot(1, 2, 2)
sns.scatterplot(data, y='NumWebVisitsMonth', x='Income', hue="Education")
plt.tight_layout()
plt.show()

These two graphs tell interesting and complementary stories. The first says that the fewer number web visits per month, the higher income the person while the extreme right skew of customers making a lot of visits earn a lot less. The graph at right shows this as well, that only the very lowest income customers visit the website more than 10 times per month. This seems to contradict that the highest income individuals made the most online purchases.

In [175]:
sns.jointplot(data, x='NumWebVisitsMonth', y='NumWebPurchases' ,hue="Education");

Unfortunately, correlation is not robust to outliers, so these 9 people might be constituting the entirety of the -0.52 correlation. Let's check this.

In [176]:
NumWebPurchases_no_outliers=data[data['NumWebPurchases']<15]['NumWebPurchases']
NumWebVisitsMonth_no_outliers=data[data['NumWebVisitsMonth']<15]['NumWebVisitsMonth']
print(NumWebPurchases_no_outliers.corr(NumWebVisitsMonth_no_outliers))
sns.scatterplot(x=NumWebVisitsMonth_no_outliers, y=NumWebPurchases_no_outliers)
-0.010576900440938886
Out[176]:
<Axes: xlabel='NumWebVisitsMonth', ylabel='NumWebPurchases'>

The heatmap above is a bit unwieldy, but it may prove useful to look at two subsets of it:

  • the various amounts purchased
  • the nth responses to campaigns
In [177]:
cols_with_Mnt = [col for col in data.columns if "Mnt" in col]
cols_with_Mnt_Edu = [col for col in data.columns if "Mnt" in col or 'Education' in col]
cols_with_accepted = [col for col in data.columns if "Accepted" in col or "Response" in col]
print(cols_with_Mnt)
print(cols_with_accepted)
['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response']
In [178]:
fig = plt.figure(figsize = (18, 6))

sns.heatmap(data[cols_with_Mnt].corr(), annot = True);

plt.xticks(rotation = 45);
In [179]:
sns.pairplot(data[cols_with_Mnt_Edu], corner=True, hue='Education')
Out[179]:
<seaborn.axisgrid.PairGrid at 0x7a2514d84d90>
In [180]:
fig = plt.figure(figsize = (18, 6))

sns.heatmap(data[cols_with_accepted].corr(), annot = True);

plt.xticks(rotation = 45);

Not a whole lot of information other than responding to the last campaign was modestly correlated with responding to any prior campaign, thooguh the third campaign also seemed oddly unique in its strategy for not being correlated to other campaigns.

In [181]:
means=data[cols_with_accepted].mean()
plt.xticks(rotation = 45);
plt.ylabel("Success rate")
sns.barplot(means)
Out[181]:
<Axes: ylabel='Success rate'>

Campaign 2 was the least successful. The last campaign was the most successful. Let's break this down by demographic.

In [182]:
import numpy
In [183]:
for col in cols_with_accepted:

  print(col)
  demo_data=data.pivot_table(index='Marital_Status', columns='Education', values=col)
  plt.figure(figsize = (5, 5)) # To resize the plot
  sns.heatmap(demo_data, cmap = 'viridis', annot = True)
  plt.show()
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Response
In [184]:
demo_data=data.pivot_table(index='Marital_Status', columns='Education', values='NumDealsPurchases')
plt.figure(figsize = (5, 5)) # To resize the plot
sns.heatmap(demo_data, cmap = 'viridis', annot = True)
plt.show()
  • First campaign: was most successful with 2n cycle widows and masters widows
  • Second campaign: divorced 2n cycles, but still low success overall
  • Third campaign: cohabitating customers with basic education
  • Fourth and fifth campaigns: masters widows
  • Last campaign: we note that this was the most successful campaign overall and it was most successful with masters widows; divorced, single, and widowed PhDs; but decently successful with divorced and single people in general.
  • Discounts were most effective for divorced people and least effective for people with a Basic education. Notably, the Masters students had a higher number of discounted purchases than the 2N cycles.

Feature Engineering and Data Processing¶

In this section, we will first prepare our dataset for analysis.

  • Imputing missing values Already completed above.

Think About It:

  • Can we extract the age of each customer and create a new feature?
  • Can we find the total kids and teens in the home?
  • Can we find out how many members each family has?
  • Can we find the total amount spent by the customers on various products?
  • Can we find out how long the customer has been with the company?
  • Can we find out how many offers the customers have accepted?
  • Can we find out amount spent per purchase?
In [185]:
new_cols=[]
In [186]:
#current customer age
data['Age']=2016-data['Year_Birth']
In [187]:
new_cols.append('Age')#keeping a running list of my new cols for EDA
In [188]:
data.drop('Year_Birth', axis=1, inplace=True)
In [189]:
#total kids
data['TotDependents']=data['Kidhome']+data['Teenhome']
new_cols.append('TotDependents')
#family size
data['Adults']=data.apply(lambda x: 1 if x['Marital_Status'] in ['Single ', 'Widow', 'Divorced'] else 2, axis=1)
data['FamilySize']=data['TotDependents']+data['Adults']
new_cols.append('FamilySize')
#total spent
data['TotSpent']=data[cols_with_Mnt].sum(axis=1)
new_cols.append('TotSpent')
#customer longevity
data['Longevity']=2016-data['Year']
new_cols.append('Longevity')
#total offers accepted
#data[cols_with_accepted].corr()>0.5
In [190]:
print(cols_with_accepted)
['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response']
In [191]:
data['TotAccepted']=data[cols_with_accepted].sum(axis=1)
new_cols.append('TotAccepted')
In [192]:
#Total number purchases
data['TotNumPurchased']=data['NumCatalogPurchases']+data['NumStorePurchases']+data['NumWebPurchases']
new_cols.append('TotNumPurchased')
#Spent per purchase, Divide by 0 exception set to 0
data['AvgPerPurchase']=data.apply(lambda x: 0 if x['TotNumPurchased']==0 else x['TotSpent'] / x['TotNumPurchased'], axis=1)
new_cols.append('AvgPerPurchase')
In [193]:
#data[cols_with_accepted].corr()>0.5
In [194]:
data['AvgPerPurchase'].min()
Out[194]:
0.0

Statistics for engineered features.

In [195]:
for col in new_cols:
    print(col)

    print('Skew :', round(data[col].skew(), 2))

    plt.figure(figsize = (15, 4))

    plt.subplot(1, 2, 1)

    data[col].hist(bins = 10, grid = False)

    plt.ylabel('count')

    plt.subplot(1, 2, 2)

    sns.boxplot(x = data[col])

    plt.show()
Age
Skew : 0.09
TotDependents
Skew : 0.42
FamilySize
Skew : 0.18
TotSpent
Skew : 0.86
Longevity
Skew : 0.04
TotAccepted
Skew : 2.43
TotNumPurchased
Skew : 0.3
AvgPerPurchase
Skew : 20.75

Removing one outlier from AvgPerPurchase.

In [196]:
#data[data['AvgPerPurchase']>500]['AvgPerPurchase']=data['AvgPerPurchase'].mean()
data.loc[data['AvgPerPurchase'] > 500, 'AvgPerPurchase']=data['AvgPerPurchase'].mean()
In [197]:
data[data['AvgPerPurchase']>500]['AvgPerPurchase']
Out[197]:
Series([], Name: AvgPerPurchase, dtype: float64)
In [198]:
#data[cols_with_accepted].corr()>0.5
In [199]:
print('Skew :', round(data['AvgPerPurchase'].skew(), 2))

plt.figure(figsize = (15, 4))

plt.subplot(1, 2, 1)

data['AvgPerPurchase'].hist(bins = 10, grid = False)

plt.ylabel('count')

plt.subplot(1, 2, 2)

sns.boxplot(x = data['AvgPerPurchase'])

plt.show()
Skew : 1.3

Insights for Engineered Features¶

  1. Age is roughly normally distributed, as expected from Year_Birth data
  2. Graphs of total number of purchases, total amount purchases, and average amount per purchase all skewed strongly right, with a few outlier customers dominating each of these metrics.
  3. Demographic variables like family size and number of dependents followed a binomial distribution, as did customer longevity.

Important Insights from EDA and Data Preprocessing¶

What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?

Data Preparation for Segmentation¶

  • The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
  • Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
  • Plot the correlation plot after we've removed the irrelevant variables
  • Scale the Data

Separating out behavior coluns: We note that cols_with_Mnt and cols_with_accepted are already subsets of behavior. To this we'll add: Recency, all Num__Purchases, NumWebVisitsMonth, and Complain and new columns: AvgPerPurchase,TotSpent ,TotAccepted

In [200]:
#data[cols_with_accepted].corr()>0.5
In [201]:
cols_behavior = cols_with_Mnt+cols_with_accepted+[col for col in data.columns if "Num" in col]+['TotAccepted','Complain','AvgPerPurchase','TotSpent', 'Recency']
print(cols_behavior)
['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'TotNumPurchased', 'TotAccepted', 'Complain', 'AvgPerPurchase', 'TotSpent', 'Recency']

We also need demographic columns: TotDependents, Age, FamilySize,Education, Marital_Status, Income

In [202]:
data_dums=pd.get_dummies(data, drop_first=True)
In [203]:
cols_demo=['TotDependents', 'Age', 'FamilySize', 'Income']+[cols for cols in data_dums.columns if "Status" in cols]+[cols for cols in data_dums.columns if "Education" in cols]
In [204]:
data_behavior=data[cols_behavior]
data_demo=data_dums[cols_demo]
In [205]:
plt.figure(figsize  = (20, 20))

sns.heatmap(data_behavior.corr(), annot = True, cmap = "YlGnBu")

plt.show()

Looking at all the behavior columns, there are some notably strong correlations:

  • TotSpent and MntWines
  • TotSpent and MntMeatProducts
  • TotSpent and NumCatalogPurchases
  • TotSpent and TotNumPurchased

From this, we conclude that wine and meat were more expensive than other products and that more of these items were bought through the catalog than through other means.

  • Catalog purchases and Meat
  • TotNumPurchased and Num purchased through web, catalog, or store.

Meat purchases were mostly done through catalog and total number of purchases did not vary strongly by means (web, catalog, or store).

  • Most of the Mnts were strongly correlated to each other.

Very little substitional effect: big shoppers in one category were also big shoppers in the next category.

On the other hand, Accepted 2, Accepted 3, and Recency, and Complain were all poorly correlated with anything, we will drop them.

In [206]:
data_behavior.drop(columns=['AcceptedCmp2', 'AcceptedCmp3', 'Recency','Complain'], inplace=True)

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Scaling the data¶

In [207]:
scaler=StandardScaler()
data_behavior_scaled=pd.DataFrame(scaler.fit_transform(data_behavior), columns=data_behavior.columns)
data_behavior_scaled.head()
Out[207]:
MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 Response NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth TotNumPurchased TotAccepted AvgPerPurchase TotSpent
0 0.983781 1.551577 1.679702 2.462147 1.476500 0.843207 -0.28383 -0.28014 -0.262111 2.388846 0.349414 1.409304 2.510890 -0.550785 0.693904 1.313544 0.623174 1.201453 1.679417
1 -0.870479 -0.636301 -0.713225 -0.650449 -0.631503 -0.729006 -0.28383 -0.28014 -0.262111 -0.418612 -0.168236 -1.110409 -0.568720 -1.166125 -0.130463 -1.185022 -0.502560 -1.018283 -0.961275
2 0.362723 0.570804 -0.177032 1.345274 -0.146905 -0.038766 -0.28383 -0.28014 -0.262111 -0.418612 -0.685887 1.409304 -0.226541 1.295237 -0.542647 1.035926 -0.502560 0.047523 0.282673
3 -0.870479 -0.560857 -0.651187 -0.503974 -0.583043 -0.748179 -0.28383 -0.28014 -0.262111 -0.418612 -0.168236 -0.750450 -0.910898 -0.550785 0.281720 -0.907403 -0.502560 -0.949003 -0.918094
4 -0.389085 0.419916 -0.216914 0.155164 -0.001525 -0.556446 -0.28383 -0.28014 -0.262111 -0.418612 1.384715 0.329427 0.115638 0.064556 -0.130463 0.203070 -0.502560 -0.240366 -0.305254
In [208]:
data_behavior_scaled.describe().T
Out[208]:
count mean std min 25% 50% 75% max
MntWines 2240.0 -7.612958e-17 1.000223 -0.903167 -0.832592 -0.387599 0.595248 3.533390
MntFruits 2240.0 -2.379049e-17 1.000223 -0.661449 -0.636301 -0.460265 0.168436 4.343008
MntMeatProducts 2240.0 4.123686e-17 1.000223 -0.739813 -0.668912 -0.442913 0.288259 6.904261
MntFishProducts 2240.0 1.506731e-17 1.000223 -0.687068 -0.632140 -0.467355 0.228401 4.055064
MntSweetProducts 2240.0 -1.506731e-17 1.000223 -0.655733 -0.631503 -0.461894 0.143854 5.716737
MntGoldProds 2240.0 -1.110223e-17 1.000223 -0.844046 -0.671486 -0.383886 0.229660 6.096698
AcceptedCmp4 2240.0 5.551115e-17 1.000223 -0.283830 -0.283830 -0.283830 -0.283830 3.523233
AcceptedCmp5 2240.0 -3.013462e-17 1.000223 -0.280140 -0.280140 -0.280140 -0.280140 3.569640
AcceptedCmp1 2240.0 6.264830e-17 1.000223 -0.262111 -0.262111 -0.262111 -0.262111 3.815174
Response 2240.0 -7.295751e-17 1.000223 -0.418612 -0.418612 -0.418612 -0.418612 2.388846
NumDealsPurchases 2240.0 -9.833404e-17 1.000223 -1.203537 -0.685887 -0.168236 0.349414 6.561217
NumWebPurchases 2240.0 -5.709718e-17 1.000223 -1.470368 -0.750450 -0.030532 0.689386 8.248526
NumCatalogPurchases 2240.0 9.516197e-18 1.000223 -0.910898 -0.910898 -0.226541 0.457817 8.670110
NumStorePurchases 2240.0 7.612958e-17 1.000223 -1.781466 -0.858455 -0.243114 0.679896 2.218248
NumWebVisitsMonth 2240.0 -6.344132e-17 1.000223 -2.191381 -0.954831 0.281720 0.693904 6.052291
TotNumPurchased 2240.0 -1.118153e-16 1.000223 -1.740259 -0.907403 -0.074548 0.758307 2.701637
TotAccepted 2240.0 -2.537653e-17 1.000223 -0.502560 -0.502560 -0.502560 0.623174 5.126107
AvgPerPurchase 2240.0 -1.110223e-16 1.000223 -1.242751 -0.810443 -0.256993 0.389004 4.998005
TotSpent 2240.0 3.647876e-17 1.000223 -0.997813 -0.891937 -0.348436 0.730262 3.187435
In [209]:
# Creating copy of the data to store labels from each algorithm
data_behavior_scaled_copy = data_behavior_scaled.copy(deep = True)

Applying PCA¶

In [210]:
# Defining the number of principal components to generate
n = data_behavior_scaled.shape[1]
print("Max number of principal components", n)

# Finding principal components for the data
cols=['PC1','PC2']
pca1 = PCA(n_components = n, random_state = 1)
data_pca = pd.DataFrame(pca1.fit_transform(data_behavior_scaled))
#transformed_data=pd.DataFrame(data_pca, columns=cols)

# The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_
Max number of principal components 19
In [211]:
# Visualize the explained variance by individual components
plt.figure(figsize = (10, 10))

plt.plot(range(1, n+1), pca1.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')

plt.title("Explained Variances by Components")

plt.xlabel("Number of Components")

plt.ylabel("Cumulative Explained Variance")

plt.show()
In [212]:
# Find the least number of components that can explain more than 70% variance
sum = 0

for ix, i in enumerate(exp_var1):

    sum = sum + i
    print(i, sum)
    if(sum>0.70):
        print("Number of PCs that explain at least 70% variance: ", ix + 1)
        break
0.43315864706841356 0.43315864706841356
0.12136306680420306 0.5545217138726166
0.09061677039937804 0.6451384842719947
0.05372784202593959 0.6988663262979343
0.039727082771396215 0.7385934090693306
Number of PCs that explain at least 70% variance:  5
In [213]:
colsall=[]
for i in range(1,ix+2):
  colsall.append("PC"+ str(i))
print(colsall)
['PC1', 'PC2', 'PC3', 'PC4', 'PC5']
In [214]:
pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:ix+1], 2), index = data_behavior_scaled.columns, columns = colsall)
pc1
Out[214]:
PC1 PC2 PC3 PC4 PC5
MntWines 0.29 0.08 0.18 0.25 0.13
MntFruits 0.23 -0.18 -0.17 -0.21 -0.15
MntMeatProducts 0.28 -0.06 -0.17 -0.02 0.40
MntFishProducts 0.24 -0.18 -0.18 -0.20 -0.17
MntSweetProducts 0.23 -0.16 -0.15 -0.19 -0.29
MntGoldProds 0.20 -0.12 0.11 -0.32 -0.21
AcceptedCmp4 0.10 0.36 0.17 0.48 -0.19
AcceptedCmp5 0.18 0.35 -0.14 0.11 0.02
AcceptedCmp1 0.16 0.32 -0.10 -0.09 -0.40
Response 0.11 0.40 0.03 -0.51 0.23
NumDealsPurchases -0.03 -0.06 0.58 -0.20 0.31
NumWebPurchases 0.20 -0.09 0.45 -0.07 -0.31
NumCatalogPurchases 0.28 -0.07 -0.01 0.01 0.30
NumStorePurchases 0.25 -0.18 0.19 0.26 -0.15
NumWebVisitsMonth -0.20 0.14 0.38 -0.25 -0.08
TotNumPurchased 0.30 -0.15 0.25 0.10 -0.07
TotAccepted 0.19 0.53 0.00 -0.12 -0.06
AvgPerPurchase 0.30 0.02 -0.04 0.01 0.24
TotSpent 0.34 -0.02 0.01 0.06 0.16

Let's look at a plot of the first two of these:

In [215]:
pca2 = PCA(n_components = 2, random_state = 2)
data_pca = pca2.fit_transform(data_behavior_scaled)
transformed_data = pd.DataFrame(data_pca,  columns = cols)
sns.scatterplot(transformed_data, x='PC1', y='PC2')
Out[215]:
<Axes: xlabel='PC1', ylabel='PC2'>

We note that PCA's first two components - comprising 41% and 12% of the variance respectively, of the 72% explained - do produce interesting patterns that we expect to come out in our clustering algorithms.

Think about it:

  • Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?

Observation and Insights: First PCA dimension: largely centered around how much the customer spends overall.

Second PCA dimension: mostly relates to how targettable the customer is to marketing campaigns, this is obviously a highly useful dimension to know.

Third PCA dimension: largely about website usage and responsiveness to deals.

The other PCA dimensions don't have an easily interpretatable meaning beyond specific products being bought (wine vs. meat respectively).

Knowing this, we can start to make sense of our plot above. We note a gap in the lower left corner of the plot for the first 2 PCA components. This corner represents the customers who would be smaller spenders but are also targetable by marketing campaigns. The emptiness of this corner reflects that these customers are not discretionary spenders and can't be recruited by marketing campaigns. The straight lines in this part of the chart reflect that response to campaigns is strongly limited by disposible income.

Applying T-SNE¶

In [216]:
#trying various values of perplexity
for i in range(5, 65, 7):
    tsne = TSNE(n_components = 2, random_state = 2, perplexity = i)

    data_tsne = tsne.fit_transform(data_behavior_scaled)

    data_tsne = pd.DataFrame(data_tsne)

    data_tsne.columns = ['X1','X2']

    plt.figure(figsize = (7,7))

    sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)

    plt.title("perplexity = {}".format(i))

Observation and Insights: Most tSNE work at perplexity > 33 seems to return similiar clustering topology: one large "continent" in the lower X1 values (bound by x2=-5/2*x1+80) and a chain of maybe 3 main islands for larger X2. Let's stick with perplexity =33.

In [217]:
tsne = TSNE(n_components = 2, random_state = 2, perplexity = 33)

data_tsne = tsne.fit_transform(data_behavior_scaled)

data_tsne = pd.DataFrame(data_tsne)

data_tsne.columns = ['X1','X2']

plt.figure(figsize = (7,7))

sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)

plt.title("perplexity = {}".format(33))
Out[217]:
Text(0.5, 1.0, 'perplexity = 33')
In [218]:
# Let's assign points to 4 different groups
def grouping(x):
    first_component = x['X1']

    second_component = x['X2']

    if (second_component < -5/2*first_component + 80):
      return 'group 1'

    elif first_component < 40:
      return 'group_2'

    elif (second_component < -15):
        return 'group_3'

    else:
        return 'group_4'
In [219]:
data_tsne['groups'] = data_tsne.apply(grouping, axis = 1)
In [220]:
# Scatter plot for two components with hue
plt.figure(figsize = (7, 7))

sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue = 'groups')
Out[220]:
<Axes: xlabel='X1', ylabel='X2'>
In [221]:
#data_tsne.head()
data_tsne['groups'].value_counts()
Out[221]:
group 1    1708
group_4     259
group_2     206
group_3      67
Name: groups, dtype: int64

This highly uneven splitting of group is not ideal. Group 1, for instance, contains 76% of all customers.

In [222]:
all_col = data_behavior_scaled.columns[:].tolist()

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(8, 3, i + 1)

    sns.boxplot(y=data_behavior_scaled[variable], x=data_tsne['groups'])

    plt.tight_layout()

    plt.title(variable)

plt.show()
In [223]:
# importing plotly
import plotly.express as px
In [224]:
all_col = data_demo.columns[:].tolist()

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(8, 3, i + 1)

    sns.boxplot(y=data_demo[variable], x=data_tsne['groups'])
    #px.box(data_demo, y=data_demo[variable], x=data_tsne['groups'], points="all")
    plt.tight_layout()

    plt.title(variable)

plt.show()

The strongest separation seems to happen by number of web visits:

  • Group 2 is Pepe the Pennypincher. He visits the website the most. He spends the least on everything, but is the most responsive to special deals. This is consistent with his income being the lowest.
  • Group 1 is the second largest web user, but also 76% of all customers. He also uses the catalog less and buys less meat than others. Unfortunately, the amount of outliers in Group 1 makes it not very useful for marketing.
  • Group 4 uses the website a lot less. Group 4 is sharply delineated by the people who responded to campaigns, making this person Mike the Marketable. Mike is a top consumer of wine so targeting him is an effective means to clear wine inventory.
  • Group 3 is the most specific cluster and it's Grandpa Simpson. He uses the website the least and likes to come to the store to buy lots of fruits, meat, and sweet things. He buys the most of everything, except for wine, for which the top shopper is Group 4.

K-Means¶

Think About It:

  • How do we determine the optimal K value from the elbow curve?
  • Which metric can be used to determine the final K value?

Applying KMeans on the PCA data and visualize the clusters¶

In [225]:
# Empty dictionary to store the SSE for each value of K
sse = {}

# Iterate for a range of Ks and fit the scaled data to the algorithm.
# Use inertia attribute from the clustering object and store the inertia value for that K
for k in range(1, 10):
    kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)

    sse[k] = kmeans.inertia_

# Elbow plotsil
plt.figure()

plt.plot(list(sse.keys()), list(sse.values()), 'bx-')

plt.xlabel("Number of cluster")

plt.ylabel("SSE")

plt.show()

Given the sharp elbow at k=3, 3 clusters seems ideal.

In [226]:
# Empty dictionary to store the Silhouette score for each value of K
sc = {}

# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for k in range(2, 10):
    kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)

    labels = kmeans.predict(data_pca)

    sc[k] = silhouette_score(data_pca, labels)

# Elbow plot
plt.figure()

plt.plot(list(sc.keys()), list(sc.values()), 'bx-')

plt.xlabel("Number of cluster")

plt.ylabel("Silhouette Score")

plt.show()

k=3 is our best clustering for k means.

In [227]:
cols = ['PC1', 'PC2', 'PC3']
pca2 = PCA(n_components = 3, random_state = 1)
data_pca = pca2.fit_transform(data_behavior_scaled)
transformed_data = pd.DataFrame(data_pca,  columns = cols)

Cluster Profiling¶

In [228]:
transformed_data_copy = data_behavior_scaled.copy(deep = True)
In [229]:
kmeans = KMeans(n_clusters = 3, random_state = 1)
kmeans_trans = KMeans(n_clusters = 3, random_state = 1)
kmeans.fit(data_behavior_scaled)
kmeans_trans.fit(transformed_data)
# Adding predicted labels to the original data and the scaled data
data_behavior_scaled_copy['KMeans_Labels'] = kmeans.predict(data_behavior_scaled)
transformed_data_copy['KMeans_Labels'] = kmeans_trans.predict(transformed_data)
data['KMeans_Labels'] = kmeans.predict(data_behavior_scaled)
transformed_data['KMeans_Labels'] = kmeans_trans.predict(transformed_data)
In [230]:
data['KMeans_Labels'].value_counts()
Out[230]:
0    1217
2     803
1     220
Name: KMeans_Labels, dtype: int64
In [231]:
sns.scatterplot(y='PC2', x='PC1', data=transformed_data,  hue='KMeans_Labels', palette='Dark2')
Out[231]:
<Axes: xlabel='PC1', ylabel='PC2'>
In [232]:
all_col = data_behavior_scaled.columns[:].tolist()

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(8, 3, i + 1)

    sns.boxplot(y=data_behavior_scaled[variable], x=transformed_data['KMeans_Labels'])

    plt.tight_layout()

    plt.title(variable)

plt.show()
In [233]:
all_col = data_demo.columns[:].tolist()

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(8, 3, i + 1)

    sns.boxplot(y=data_demo[variable], x=transformed_data['KMeans_Labels'])

    plt.tight_layout()

    plt.title(variable)

plt.show()

Observations and Insights:

k=3 provided 3 clusters of data, but groups 0 and 1 don't split graphically as expected. The most conspicuous split is by income: Groups 0, 1, and 2 are low, medium, and high-income, respectively.

Describe the characteristics of each cluster¶

  • Group 0: lowest income, larger family size, more dependents, lots of webvisits, lower spending across the board
  • Group 1: middle income, medium family size, medium number of dependents, more catalog and store purchases
  • Group 2: highest income, smallest family size and number of dependents, more catalog and store purchases. Also, the only group to respond to campaigns, but doesn't use deals to buy,

Think About It:

  • Are the K-Means profiles providing any deep insights into customer purchasing behavior or which channels they are using?
  • What is the next step to get more meaningful insights?

Summary of each cluster: We see a similiar grouping to what tSNE produced, in that groups are most strongly stratified by web visits and one group most strongly responds to campaingns

K-Medoids¶

In [234]:
kmedo = KMedoids(n_clusters = 3, random_state = 5)
kmedo.fit(transformed_data)

data_behavior_scaled_copy['kmedoLabels'] = kmedo.predict(transformed_data)

data['kmedoLabels'] = kmedo.predict(transformed_data)
transformed_data['kmedoLabels']=kmedo.predict(transformed_data)
In [235]:
data['KMeans_Labels'].value_counts()
Out[235]:
0    1217
2     803
1     220
Name: KMeans_Labels, dtype: int64
In [236]:
data.kmedoLabels.value_counts()
Out[236]:
2    1079
1     582
0     579
Name: kmedoLabels, dtype: int64

Clusters are a bit more balanced than KMeans.

Visualize the clusters using PCA¶

In [237]:
plt.figure(figsize = (15, 4))

plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data,  hue='kmedoLabels', palette='Dark2')
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data,  hue='KMeans_Labels', palette='Dark2')
plt.show()

Similiar clustering as before, mostly by income and amount spent, but with KMedoids, the left-most group grew from from 220 to 1079,the right-most group shrank from 802 to 582, while the middle group shrank from 1217 to 579.

In [238]:
for col in  data_behavior.columns:
    sns.boxplot(x = 'kmedoLabels', y = col, data = data)

    plt.show()
In [239]:
all_col = data_demo.columns[:-4].tolist()

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(8, 3, i + 1)

    sns.boxplot(y=data_demo[variable], x=transformed_data['kmedoLabels'])

    plt.tight_layout()

    plt.title(variable)

plt.show()

Cluster Profiling¶

In [239]:
 

Observations and Insights:

Characteristics of each cluster¶

Summary for each cluster:

Observations and Insights:

Hierarchical Clustering¶

  • Find the Cophenetic correlation for different distances with different linkage methods.
  • Create the dendrograms for different linkages
  • Explore different linkages with each distance metric
In [240]:
from scipy.cluster.hierarchy import dendrogram, linkage
In [241]:
# The List of all linkage methods to check
methods = ['single',
           'average',
           'complete']

# Create a subplot image
fig, axs = plt.subplots(len(methods), 1, figsize = (20, 15))

# Enumerate through the list of all methods above, get linkage and plot dendrogram
for i, method in enumerate(methods):
    Z = linkage(transformed_data, metric = 'euclidean', method = method)

    dendrogram(Z, ax = axs[i]);

    axs[i].set_title(f'Dendrogram ({method.capitalize()} Linkage)')

    axs[i].set_ylabel('Distance')

Think about it:

  • Single Linkage does not cluster well in this case.

  • Complete Linkage does an okay job, but Average Linkage works the best for clustering separation. Cutting at a distance of 5 gives us 4 cluters.

  • Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?
  • What is the next step in obtaining number of clusters based on the dendrogram?
  • Are there any distinct clusters in any of the dendrograms?
In [242]:
plt.figure(figsize = (20, 7))

plt.title("Dendrograms")

dend = dendrogram(linkage(transformed_data, method = 'average'))

plt.axhline(y = 5, color = 'g', linestyle = '--')
Out[242]:
<matplotlib.lines.Line2D at 0x7a251636a860>
In [243]:
# Clustering with 4 clusters
hierarchical = AgglomerativeClustering(n_clusters = 4, affinity = 'euclidean', linkage = 'average')

hierarchical.fit(transformed_data)
Out[243]:
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=4)
In [244]:
data_behavior_scaled_copy['HCLabels'] = hierarchical.labels_

data['HCLabels'] = hierarchical.labels_
transformed_data['HCLabels'] = hierarchical.labels_
data_demo['HCLabels'] = hierarchical.labels_
In [245]:
data.HCLabels.value_counts()
Out[245]:
3    1077
1     927
0     233
2       3
Name: HCLabels, dtype: int64

Visualize the clusters using PCA¶

In [246]:
plt.figure(figsize = (15, 4))


plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data,  hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data,  hue='HCLabels', palette='Dark2')
plt.show()

Results very similar to KMeans and KMedoids, in that we are mostly stratifying by how much the consumers spend. HC also located 3 uniquely targetable very high spenders in their own group in what we're calling "Group 2"

Cluster Profiling¶

In [247]:
for col in data_behavior.columns:
    sns.boxplot(x = 'HCLabels', y = col, data = data)
    plt.show()
In [248]:
mean = round(data.groupby('HCLabels').mean(), 2)

median = round(data.groupby('HCLabels').median(), 2)

df_hierachical = pd.concat([mean, median], axis = 0)

df_hierachical.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']

df_hierachical[data_behavior.columns].T
Out[248]:
group_0 Mean group_1 Mean group_2 Mean group_3 Mean group_0 Median group_1 Median group_2 Median group_3 Median
MntWines 833.01 470.78 26.33 46.64 847.00 418.00 5.0 23.00
MntFruits 62.64 42.03 6.33 4.96 44.00 26.00 1.0 2.00
MntMeatProducts 495.08 251.17 23.33 23.87 490.00 183.00 3.0 15.00
MntFishProducts 88.24 60.06 3.00 7.25 69.00 39.00 1.0 3.00
MntSweetProducts 68.20 42.29 4.67 5.12 51.00 27.00 1.0 2.00
MntGoldProds 74.97 69.64 1.33 15.40 55.00 50.00 1.0 10.00
AcceptedCmp4 0.36 0.07 0.00 0.01 0.00 0.00 0.0 0.00
AcceptedCmp5 0.65 0.01 0.00 0.00 1.00 0.00 0.0 0.00
AcceptedCmp1 0.51 0.03 0.00 0.00 1.00 0.00 0.0 0.00
Response 0.60 0.11 0.00 0.08 1.00 0.00 0.0 0.00
NumDealsPurchases 1.17 2.91 15.00 2.03 1.00 2.00 15.0 2.00
NumWebPurchases 5.40 6.00 0.00 2.16 5.00 6.00 0.0 2.00
NumCatalogPurchases 5.97 4.26 0.33 0.58 6.00 4.00 0.0 0.00
NumStorePurchases 8.02 8.13 0.00 3.31 8.00 8.00 0.0 3.00
NumWebVisitsMonth 3.17 4.63 19.33 6.34 3.00 5.00 19.0 7.00
TotNumPurchased 19.39 18.38 0.33 6.06 19.00 18.00 0.0 6.00
TotAccepted 2.37 0.29 0.00 0.16 2.00 0.00 0.0 0.00
AvgPerPurchase 87.18 51.00 59.33 14.81 82.11 42.33 0.0 12.67
TotSpent 1622.14 935.97 65.00 103.24 1650.00 908.00 9.0 66.00

Group 2 Those 3 people fitting into Group2 Deal David. Deal David is an online window shopper, waiting for a deal. He is 5 times as likely to respond to a deal as anyone in any other category. And while he might spent a lot less overall, his average spent per purchase is more than almost any one else.

Group 0 is Heavy Spender Henry. He buys more of more products than anyone else and he is very targetable with campaigns.

Group 1 is a bit thrifty. She's economical Esther. He spends more than Deal David from Group 2 and makes nearly as many purchases as Heavy Spender Henry, but tends to make cheaper purchases in virtually every category.

Group 3 is the largest group, with 1077 of our customers. This is our Average Joe. Joe does not respond to campaigns, does not buy very many items or spend very much. Joe should be a last priority in marketing campaigns.

In [249]:
for col in data_demo.columns:
    sns.boxplot(x = 'HCLabels', y = col, data = data_demo)
    plt.show()

Observations and Insights:

Characteristics of each cluster¶

  • The very niche Deal David from Group 2 is single and highly educated (up to a PhD but has a very low income.
  • Heavy Spender Henry from Group 0 is high income, married, and has no children.
  • Economical Esther from Group 1 is married or cohabitating, fairly high income and older.
  • Group 3 Average Joe is cohabitating or married, like Esther, is a lower income and a larger family.

Summary of each cluster:

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

Hyperparameter Tuning (eps and min_samples)¶

In [250]:
# Empty dictionary to store the Silhouette score for each value of K
sc = {}

# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for eps in np.arange(0.5, 5.5, 0.5):
    dbs = DBSCAN(eps=eps)

    data_behavior_scaled_copy['DBSLabels'] = dbs.fit_predict(transformed_data)
    data['DBSLabels'] = dbs.fit_predict(transformed_data)
    labels = dbs.fit_predict(transformed_data)
    #print("no of labels", len(np.unique(labels)))
    if len(np.unique(labels))>1:
      sc[eps] = silhouette_score(transformed_data, labels)

# Elbow plot
plt.figure()

plt.plot(list(sc.keys()), list(sc.values()), 'bx-')

plt.xlabel("Eps")

plt.ylabel("Silhouette Score")

plt.show()

Eps=1.5 seems to work best, though anything in the range 1.5-2.5 might be plausible, in case we've hit a saddle point. It might be that eps=1.5 is solely due to default min_samples choice. Let's tune min_samples similarly.

In [251]:
# Empty dictionary to store the Silhouette score for each value of K
sc = {}

# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for min_samples in range(2,20,1):
    dbs = DBSCAN(eps=1.5, min_samples=min_samples)

    data_behavior_scaled_copy['DBSLabels'] = dbs.fit_predict(transformed_data)
    data['DBSLabels'] = dbs.fit_predict(transformed_data)
    labels = dbs.fit_predict(transformed_data)
    #print(silhouette_score(transformed_data, labels), " with min_samples", min_samples)
    #print("no of labels", len(np.unique(labels)))
    if len(np.unique(labels))>1:
      sc[min_samples] = silhouette_score(transformed_data, labels)

# Elbow plot
plt.figure()

plt.plot(list(sc.keys()), list(sc.values()), 'bx-')

plt.xlabel("min_samples")

plt.ylabel("Silhouette Score")

plt.show()

min_samples=5 (which is the default) works well. min_samples=3 works slightly better but does not change clustering appearance (see below). Also, min_samples=3 might be an overfit. (Silh.=0.56)

For eps = 2.5, min_samples=20 works better (Silh. =0.56) For eps = 2, min_samples=5 again works best (Silh.=0.56)

For all the values of eps between 1.5 and 2.5, we have checked for robustness of results in clustering.

Apply DBSCAN for the best hyperparameter and visualize the clusters from PCA¶

In [252]:
dbs = DBSCAN(eps = 1.5, min_samples=5)

data_behavior_scaled_copy['DBSLabels'] = dbs.fit_predict(transformed_data)

data['DBSLabels'] = dbs.fit_predict(transformed_data)
transformed_data['DBSLabels']=dbs.fit_predict(transformed_data)
data_demo['DBSLabels'] = dbs.fit_predict(transformed_data)
In [253]:
data['DBSLabels'].value_counts()
Out[253]:
 1    1077
 0     922
 2     228
-1      13
Name: DBSLabels, dtype: int64

Coincidentally, we also arrive at 4 groups. Let's see how it looks on the first two PCA components.

In [254]:
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data,  hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data,  hue='DBSLabels', palette='Dark2')
plt.show()

Observations and Insights: Similar clustering to before, except that the DBSLabel = -1, is pulling 13 customers into their own group for no apparent reason. Because this is only the first two dimensions of PCA, maybe we're missing something in the next dimension, that the DBSCAN algorithim is picking up for this cluster.

In [255]:
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data,  hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data,  hue='DBSLabels', palette='Dark2')
plt.show()

PC3 also shows a similar scattering of this -1 category that does not admit simple explanation.

Characteristics of each cluster¶

In [256]:
mean = round(data.groupby('DBSLabels').mean(), 2)

median = round(data.groupby('DBSLabels').median(), 2)

df_hierachical = pd.concat([mean, median], axis = 0)

df_hierachical.index = ['group_-1 Mean', 'group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_-1 Median', 'group_0 Median', 'group_1 Median', 'group_2 Median']

df_hierachical[data_behavior.columns].T
Out[256]:
group_-1 Mean group_0 Mean group_1 Mean group_2 Mean group_-1 Median group_0 Median group_1 Median group_2 Median
MntWines 595.46 468.15 46.64 838.64 408.00 415.50 23.00 851.50
MntFruits 56.23 42.03 4.96 61.82 14.00 26.00 2.00 43.50
MntMeatProducts 359.00 248.80 23.87 500.89 161.00 181.50 15.00 500.00
MntFishProducts 51.23 60.03 7.25 88.74 12.00 39.00 3.00 69.00
MntSweetProducts 46.69 42.35 5.12 67.78 12.00 26.50 2.00 51.00
MntGoldProds 35.15 69.64 15.40 76.13 22.00 50.00 10.00 56.50
AcceptedCmp4 0.31 0.07 0.01 0.36 0.00 0.00 0.00 0.00
AcceptedCmp5 0.31 0.01 0.00 0.66 0.00 0.00 0.00 1.00
AcceptedCmp1 0.46 0.03 0.00 0.50 0.00 0.00 0.00 0.00
Response 0.62 0.11 0.08 0.59 1.00 0.00 0.00 1.00
NumDealsPurchases 5.31 2.91 2.03 1.15 2.00 2.00 2.00 1.00
NumWebPurchases 4.69 5.99 2.16 5.43 4.00 6.00 2.00 5.00
NumCatalogPurchases 3.62 4.25 0.58 6.00 2.00 4.00 0.00 6.00
NumStorePurchases 5.08 8.13 3.31 8.08 6.00 8.00 3.00 8.00
NumWebVisitsMonth 8.15 4.62 6.34 3.14 7.00 5.00 7.00 3.00
TotNumPurchased 13.38 18.37 6.06 19.50 15.00 18.00 6.00 19.00
TotAccepted 1.92 0.29 0.16 2.33 1.00 0.00 0.00 2.00
AvgPerPurchase 65.32 50.90 14.81 87.68 70.74 42.33 12.67 82.06
TotSpent 1143.77 931.00 103.24 1633.99 1662.00 907.00 66.00 1654.00

Of course, averages will tell us nothing useful about Group -1, because while the points are scattered wildly, their averages will still settle towards centrality on each dimension (application of CLT?)

In [257]:
for col in data_behavior.columns:
    sns.boxplot(x = 'DBSLabels', y = col, data = data)
    plt.show()
In [258]:
for col in data_demo.columns:
    sns.boxplot(x = 'DBSLabels', y = col, data = data_demo)
    plt.show()
In [259]:
mean = round(data_demo.groupby('DBSLabels').mean(), 2)
print(mean)
           TotDependents    Age  FamilySize    Income  Marital_Status_Married  \
DBSLabels                                                                       
-1                  0.46  43.54        2.15  54476.94                    0.23   
 0                  0.80  49.56        2.65  63230.07                    0.39   
 1                  1.24  45.11        3.11  35589.86                    0.39   
 2                  0.21  46.76        2.07  80079.98                    0.39   

           Marital_Status_Single  Marital_Status_Together  \
DBSLabels                                                   
-1                          0.23                     0.23   
 0                          0.20                     0.26   
 1                          0.22                     0.26   
 2                          0.24                     0.22   

           Marital_Status_Widow  Education_Basic  Education_Graduation  \
DBSLabels                                                                
-1                         0.00             0.00                  0.54   
 0                         0.04             0.00                  0.51   
 1                         0.03             0.05                  0.49   
 2                         0.06             0.00                  0.52   

           Education_Master  Education_PhD  HCLabels  
DBSLabels                                             
-1                     0.00           0.38      0.77  
 0                     0.16           0.25      1.00  
 1                     0.17           0.19      3.00  
 2                     0.17           0.24      0.00  

Summary of each cluster: These clusters aren't as illustrative. Again, we get that the highest income group (group 2) (only 228 customers) spends the most, and has the smallest family. Group 2 responds to campaigns, but not specials deals.

Group 1 is the largest, with 1077 and they have lowest income, bigger families, and buy a lot less. They don't respond to campaigns or especially to deals.

Group 0 has 922 customers. They don't respond to campaigns or especially to deals. They are middle income and spent a corresponding amount.

The mysteriously scattered Group -1 is the most educated (PhD) and the youngest. The only thing they seem to have in common is that they respond to deals and campaigns. There are 13 of them.

Gaussian Mixture Model¶

Three clusters have so far seems the best, but let's do one last tuning based on sil. score to confirm this.

In [260]:
sc = {}

# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for k in np.arange(2,8):
    gmm = GaussianMixture(n_components = k, random_state = 1)

    gmm.fit(transformed_data)
    labels = gmm.predict(transformed_data)
    #print("no of labels", len(np.unique(labels)))
    if len(np.unique(labels))>1:
      sc[k] = silhouette_score(transformed_data, labels)

# Elbow plot
plt.figure()

plt.plot(list(sc.keys()), list(sc.values()), 'bx-')

plt.xlabel("k")

plt.ylabel("Silhouette Score")

plt.show()

We confirm that 3 clusters is the best choice for GMM as well. (Silh.=0.56 again)

In [261]:
gmm = GaussianMixture(n_components = 3, random_state = 1)

gmm.fit(transformed_data)

data_behavior_scaled_copy['GmmLabels'] = gmm.predict(transformed_data)
data_behavior['GmmLabels'] = gmm.predict(transformed_data)
data['GmmLabels'] = gmm.predict(transformed_data)
data_demo['GmmLabels'] = gmm.predict(transformed_data)
transformed_data['GmmLabels'] = gmm.predict(transformed_data)
In [262]:
data.GmmLabels.value_counts()
Out[262]:
1    1080
0     925
2     235
Name: GmmLabels, dtype: int64

Observations and Insights:

Visualize the clusters using PCA¶

In [263]:
plt.figure(figsize = (15, 4))


plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data,  hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data,  hue='GmmLabels', palette='Dark2')
plt.show()
In [264]:
plt.figure(figsize = (15, 4))


plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data,  hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data,  hue='GmmLabels', palette='Dark2')
plt.show()

Cluster 0 shrunk a bit (n=925 here, n=1217 in Kmeans, n=579 in KMedoids, n=927 in HC, n=922 in DBS), but otherwise they look very similiar.

Cluster Profiling¶

In [265]:
mean = round(data_demo.groupby('GmmLabels').mean(), 2)
print(mean)
mean = round(data_behavior.groupby('GmmLabels').mean(), 2)
print(mean)
           TotDependents    Age  FamilySize    Income  Marital_Status_Married  \
GmmLabels                                                                       
0                   0.80  49.54        2.65  63233.64                    0.39   
1                   1.24  45.10        3.11  35503.76                    0.39   
2                   0.21  46.69        2.07  79828.37                    0.39   

           Marital_Status_Single  Marital_Status_Together  \
GmmLabels                                                   
0                           0.20                     0.26   
1                           0.22                     0.26   
2                           0.24                     0.23   

           Marital_Status_Widow  Education_Basic  Education_Graduation  \
GmmLabels                                                                
0                          0.04             0.00                  0.51   
1                          0.03             0.05                  0.49   
2                          0.06             0.00                  0.52   

           Education_Master  Education_PhD  HCLabels  DBSLabels  
GmmLabels                                                        
0                      0.16           0.25      1.00      -0.00  
1                      0.17           0.19      3.00       0.99  
2                      0.16           0.24      0.01       1.91  
           MntWines  MntFruits  MntMeatProducts  MntFishProducts  \
GmmLabels                                                          
0            467.90      41.91           250.21            59.93   
1             46.58       4.97            23.87             7.24   
2            841.26      62.93           496.81            88.50   

           MntSweetProducts  MntGoldProds  AcceptedCmp4  AcceptedCmp5  \
GmmLabels                                                               
0                     42.28         69.64          0.08          0.01   
1                      5.12         15.36          0.01          0.00   
2                     68.01         74.90          0.36          0.66   

           AcceptedCmp1  Response  NumDealsPurchases  NumWebPurchases  \
GmmLabels                                                               
0                  0.03      0.11               2.92             5.98   
1                  0.00      0.08               2.07             2.16   
2                  0.50      0.60               1.17             5.47   

           NumCatalogPurchases  NumStorePurchases  NumWebVisitsMonth  \
GmmLabels                                                              
0                         4.25               8.12               4.62   
1                         0.58               3.30               6.37   
2                         5.98               8.06               3.20   

           TotNumPurchased  TotAccepted  AvgPerPurchase  TotSpent  
GmmLabels                                                          
0                    18.35         0.29           50.88    931.88  
1                     6.04         0.16           14.93    103.13  
2                    19.51         2.35           87.33   1632.41  

Characteristics of each cluster¶

Summary of each cluster:

Observations and Insights: Very similar stories to what we've seen before. Group 2 (n=235) here is almost exactly the same group as DBS 2: highest income, smallest families, highest spenders, more store and catalog purchases, fewer website visits. Not susceptible to deals, but are easily responsive to marketing campaigns.

Group 1 (n=1080) is again the web window shoppers, lowest income, least educated, largest families, smallest spenders, who spent a lot of time on the website without buying very much. They are a bit more likely to respond to deals but not campaigns. This is also the group 1 from DBS.

Group 0 (n=925). Intermediate income, family size, and amounts of spending highly educated, they don't respond to marketing campaigns but are the most responsive to deals.

Conclusion and Recommendations¶

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • On the metric of silhouette score, all methods are remarkably similiar (0.56) in their success, meaning all methods are equally good at clustering points towards smaller distances towards distances within their cluster and larger distances to points in other clusters.

2. Refined insights:

  • Across the 6 clustering methods, certain consistent patterns emerged:
    • The largest cluster of our customers are lower income, with large families,and have very little disposible income. Campaigns are a waste of money on them, as they do not respond, but they are very responsive to discounts. They window shop online for details but do not actually buy very much.
    • Another cluster that consistently emerges is the higher income groupwith smaller families, who don't care for discounts, but who are highly responsive to campaigns.
  • Money spent on marketing campaigns can be much more effectively spent on the higher income cluster than anyone else, and should especially not be spent on lower income individuals, who don't respond to campaigns.
  • Money spend on discounts can be much more effectively spent on the lower income cluster, and should especially not be spent on higher income individuals, who don't respond to discounts anyway.

3. Proposal for the final solution design:

  • Of all the clustering models, I can most advocate for a mix.
    • tSNE is oddly specific in locating 67 individuals who spent particularly on meat, fruit, and sweet goods.
      • Of the remaining models, Gaussian Mixture identifies 235 high income individuals who respond strongly to campaigns and 925 individuals who respond to discounts, but being middle income, have more money to spend than the lower income customers.
      • DBScan also oddly specifically singles out 13 individuals with nothing in common other that responding to both campigns and discounts.
In [266]:
!pip install nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (6.5.4)
Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.9.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.12.3)
Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert) (6.1.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.7.1)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.4)
Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.1.3)
Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.2)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.1.5)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.8.4)
Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.10.0)
Requirement already satisfied: nbformat>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.10.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert) (24.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.5.1)
Requirement already satisfied: pygments>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.16.1)
Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.2.1)
Requirement already satisfied: traitlets>=5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.1)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert) (4.2.0)
Requirement already satisfied: jupyter-client>=6.1.12 in /usr/local/lib/python3.10/dist-packages (from nbclient>=0.5.0->nbconvert) (6.1.12)
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (2.19.1)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (4.19.2)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert) (2.5)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (1.16.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (0.5.1)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (23.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.34.0)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.18.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (23.2.1)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.8.2)
Requirement already satisfied: tornado>=4.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.3.3)
In [ ]:
import nbconvert
import os
from nbconvert import HTMLExporter